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1 . A computer assisted method of auditing a superset of training data, the 
superset comprising examples of documents having one or more category 
assignments, the method including: 

partitioning the superset into at least two disjoint sets, including a test set and a 
training set, wherein the test set includes one or more test documents and the 
training set includes examples of documents belonging belong to at least two 
categories; 

categorizing the test documents using the training set; 

calculating a metric of confidence based on results of the categorizing step and 
the category assignments for the test documents; and 

reporting the test documents and category assignments that are suspicious and 
that appear to be missing, based on the metric of confidence. 

2. The method of claim 1, further including repeating the partitioning, 
categorizing and calculating steps until at least one-half of the documents in the 
superset have been assigned to the test set. 

3. The method of claim 2, wherein the test set created in the partition step has a 
single test document. 

4. The method of claim 2, wherein the test set created in the partition step has a 
plurality of test documents. 

5 . The method of claim 1 , further including repeating the partitioning, 
categorizing and calculating steps until substantially all of the documents in the 
superset have been assigned to the test set. 

6. The method of claim 1, wherein the partitioning, categorizing and calculating 
steps are carried out substantially without user intervention. 

7. The method of claim 5, wherein the partitioning, categorizing and calculating 
steps are carried out substantially without user intervention. 

8. The method of claim 1, wherein the partitioning, categorizing, calculating and 
reporting steps are carried out substantially without user intervention. 
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1 9. The method of claim 5, wherein the partitioning, categorizing, calculating and 

2 reporting steps are carried out substantially without user intervention. 

1 10. The method of claim 1, wherein the categorizing step includes determining k 

2 nearest neighbors of the test documents and the calculating step is based on a k 

3 nearest neighbors categorization logic. 

1 11. The method of claim 10, wherein the metric of confidence is an unweighted 

2 measure of distance between the test document and the examples of documents 

3 belonging to various categories. 

1 12. The method of claim 1 1, where the unweighted measure includes application 

2 of a relationship Q 0 (d t , T m ) = s(d t , d) , wherein 

3 Qo is a function of the test document represented by the a feature vector d t and of 

4 various categories T m ; and 

5 s is a metric of distance between the test document feature vector d t and certain 

6 sample documents represented by feature vectors d, the certain sample 

7 documents being among a set of k nearest neighbors of the test document having 

8 category assignments to the various categories T m , 

1 13. The method of claim 10, wherein the metric of confidence is a weighted 

2 measure of distance between the test document and the examples of documents 

3 belonging to various categories, the weighted measure taking into account the density 

4 of a neighborhood of the test document. 

1 14. The method of claim 13 where the weighted measure includes application of 

2 a relationship Q^TJ = d * €{ ^ )n7 * } , wherein 

Zj *( d t> d 2 ) 

3 Qx is a function of the test document represented by the a feature vector d t and of 

4 various categories T m ; and 

5 s is a metric of distance between the test document feature vector d t and certain 

6 sample documents represented by feature vectors di and d2 ? the certain sample 

7 documents dj being among a set of k nearest neighbors of the test document 
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having category assignments to the various categories T m and the certain sample 
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documents d2 being among a set of k nearest neighbors of the test document. 
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15. The method of claim 1, wherein the identifying step further includes filtering 
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the test documents based on the metric of confidence. 
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16. The method of claim 15, wherein the filtering step further includes color 
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coHitio - the identifier! test rlocnments haserl nn the metric of* confidence 
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17. The method of claim 15, wherein the filtering step further includes selecting 
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for disnlav the identified test documents leased on the metric of confidence 
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18. The method of claim 1, wherein the user interface is a printed report. 
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19. The method of claim 1, wherein the user interface is a file confonning to 
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XML syntax. 
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20. The method of claim 1, wherein the user interface is a sorted display 




2 


identifying at least a portion of the test documents. 
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21. The method of claim 1, further including calculating a precision score for the 
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identified test documents. 
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22. A computer assisted method of auditing a superset of training data, the 




2 


superset comprising examples of documents having one or more category 
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fissi onments the method indndincx* 




4 


determining k nearest neighbors of the documents in the superset; 


- 


5 


categorizing the documents based on the k nearest neighbors into a plurality of 




O 


categories; 
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calculating a metric of confidence based on results of the categorizing step and 
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the category assignments for the documents; and 
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reporting the documents and category assignments that are suspicious and that 
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appear to be missing, based on the metric of confidence. 
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23. The method of claim 22, wherein the metric of confidence is an unweighted 
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measure of distance between the test document and the examples of documents 
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belonging to various categories. 
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1 24. The method of claim 23, where the unweighted measure includes application 

2 of a relationship Q 0 (d t , T m ) = £ s(d t , d) , wherein 

3 Qo is a function of the test document represented by the a feature vector d t and of 

4 various categories T m ; and 

5 i* is a metric of distance between the test document feature vector d t and certain 

6 sample documents represented by feature vectors d, the certain sample 

7 documents being among a set of k nearest neighbors of the test document having 

8 category assignments to the various categories T m , 

1 25. The method of claim 22, wherein the metric of confidence is a weighted 

2 measure of distance between the test document and the examples of documents 

Hp 3 belonging to various categories, the weighted measure taking into account the density 

4 of a neighborhood of the test document. 

1 26. The method of claim 25, wherein the weighted measure includes application 

2 of a relationship Q l (d t> T m ) = g^i^QZki , wherein 

d 2 eK(d t ) 

3 Qi is a function of the test document represented by the a feature vector d t and of 

4 various categories T m ; and 

5 s is a metric of distance between the test document feature vector d t and certain 

6 sample documents represented by feature vectors di and d2, the certain sample 

7 documents di being among a set of k nearest neighbors of the test document 

8 having category assignments to the various categories T m and the certain sample 

9 documents d2 being among a set of k nearest neighbors of the test document 

1 27. The method of claim 22, wherein the determining, categorizing and 

2 calculating steps are carried out substantially without user intervention. 

1 28. The method of claim 22, wherein the identifying step further includes 

2 filtering the documents based on the metric of confidence. 

1 29. The method of claim 28, wherein the filtering step further includes color 

2 coding the identified documents based on the metric of confidence. 
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30. The method of claim 28, wherein the filtering step further includes selecting 
for display the identified documents based on the metric of confidence. 

3 1 . The method of claim 22, wherein the user interface is a printed report. 

32. The method of claim 22, wherein the user interface is a file conforming to 
XML syntax. 

33. The method of claim 22, wherein the user interface is a sorted display 
identifying at least a portion of the documents. 

34. The method of claim 22, further including calculating a precision score for 
the identified documents. 
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